print("This is R code!") # click the play button (or CTRL/CMD + Enter) to run it![1] "This is R code!"
2024 MMI microbiome morning
This is a Quarto “code notebook” - which contains a mixture of normal text and R code.
print("This is R code!") # click the play button (or CTRL/CMD + Enter) to run it![1] "This is R code!"
When you run the code, it shows the output underneath.
We’re going to look at an example stool microbiota dataset, from the Papa et al. 2012 study on paediatric IBD patients (Crohn’s and Ulcerative Colitis) and controls.
These data have already been processed into a table: counts of how often each sequence occurs in each sample. It started as huge “fastq” text files, full of As, Cs, Ts and Gs, but we will not practice sequence data processing today, because it takes quite a long time to run.
At MMI, we run the latest processing approaches on our computational resource, Abacus HPC cluster.
We do amplicon sequencing with Illumina MiSeq, or similar technologies, and denoise the output with DADA2 to produce ASV count tables.
The example data we will use today are older. The amplicons were sequenced with “454 pyrosequencing” and clustered into OTUs, but the core principles of statistical analysis are the same for both data types.
I have written some R code. You just need to run each step in order. Try to follow what each line of code is doing. Ask questions if something is unclear, or if you want to know more.
First we will load some useful R packages
Then we will read the data files we need to analyse, and inspect them
Then we’ll compare the microbiota of cases and controls using two types of analysis, and graphs:
here() starts at /Users/david/Documents/git/R-projects/teaching/workshops/2024-MMI-microbiome-morning
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggh4x) # "ggh4x" helps with customising plots
library(ggstatsplot) # "ggstatsplot" provides tools for plotting statistical test resultsYou can cite this package as:
Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
library(phyloseq) # "phyloseq" provides basic tools for microbiome data science
library(microViz) # "microViz" provides tools for microbiome visualisation and analysismicroViz version 0.12.5 - Copyright (C) 2021-2024 David Barnett
! Website: https://david-barnett.github.io/microViz
✔ Useful? For citation details, run: `citation("microViz")`
✖ Silence? `suppressPackageStartupMessages(library(microViz))`
The data about the patients’ age, sex, etc. are stored in a file called "all_metadata.rds"
.RDS indicates that this file is in a special “R data storage” format.
We first need to read this file into our active R session:
The data is now stored in a table-shaped object called meta.
To inspect this data, type View(meta) into the R Console (at the bottom of your screen), and press enter.
Next, we will read in a table of microbial counts (i.e. the processed sequencing data): this is stored as a TSV (tab-separated variables) formatted text file.
Rows: 91 Columns: 36350
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): sample
dbl (36349): OTU_00001, OTU_00002, OTU_00003, OTU_00004, OTU_00005, OTU_0000...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question: how can we inspect this data?
Next we will read the taxonomy table stored in "papa2012_taxonomy_table.txt"
Rows: 36349 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (7): OTU, Kingdom, Phylum, Class, Order, Family, Genus
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: it doesn’t actually matter what name we give to the data objects, but a good name helps make the code readable.
x <- meta # This creates a copy of meta, called x. For a simple start, let’s plot a histogram of OTU number 1.
counts %>% ggplot(aes(OTU_00001)) + geom_histogram(bins = 50)Looks like there are a lot of zeros! This code counts how many.
table(OTU00001_has_0_counts = counts$OTU_00001 == 0, useNA = "ifany")OTU00001_has_0_counts
FALSE TRUE
59 32
So far we have not used any R packages specialised for microbiome data.
Let’s do so, because it will make our next tasks a lot easier!
We will start by combining our three datasets into one phyloseq object
The code for this bit can be quite tricky. So don’t worry, you can just run it.
This code combines the count table and taxonomy table.
# `phyloseq` uses row or column names to match OTUs across count and taxonomy tables.
# In addition, the taxonomy table must contain taxonomic ranks in descending order,
# and the count table must be converted to a numeric matrix.
# these lines convert the "dataframe" of counts into a matrix
count_matrix <- counts %>% select(where(is.numeric)) %>% as.matrix()
rownames(count_matrix) <- counts$sample
# these lines convert the "dataframe" of taxonomy into a matrix
tax_matrix <- taxonomy %>% select(!OTU) %>% as.matrix()
rownames(tax_matrix) <- taxonomy$OTU
# these lines start making the phyloseq object
ps <- phyloseq(
otu_table(count_matrix, taxa_are_rows = FALSE),
tax_table(tax_matrix)
)Look what we have made so far.
psphyloseq-class experiment-level object
otu_table() OTU Table: [ 36349 taxa and 91 samples ]
tax_table() Taxonomy Table: [ 36349 taxa by 6 taxonomic ranks ]
Next we reformat the the metadata table so we can add it to our phyloseq
# phyloseq uses row names to match samples (across the OTU count table and the sample metadata).
# in the phyloseq object the sample names have underscores, but in the meta object they have hyphens
# We can fix that with the `str_replace` function.
meta$sample <- meta$sample %>% str_replace(pattern = "-", replacement = "_")
# Now make them row names, and check they all match the phyloseq sample names
meta_df <- as.data.frame(meta)
rownames(meta_df) <- meta$sample
all(rownames(meta_df) %in% sample_names(ps))[1] TRUE
Add the metadata to the phyloseq, and look at the phyloseq.
# this adds the metadata to the phyloseq object
sample_data(ps) <- meta_df
# this prints the updated phyloseq object "ps"
psphyloseq-class experiment-level object
otu_table() OTU Table: [ 36349 taxa and 90 samples ]
sample_data() Sample Data: [ 90 samples by 17 sample variables ]
tax_table() Taxonomy Table: [ 36349 taxa by 6 taxonomic ranks ]
microViz provides tools for microbiome data analysis, using phyloseq objects.
I developed the microViz package during my PhD. Many other microbiome researchers now also use it.
Let’s first take a look at the tables stored inside the phyloseq object.
samdat_tbl(ps) # retrieve sample data as a tibble# A tibble: 90 × 18
.sample_name ID sample case_control diagnosis activity active gender
<chr> <chr> <chr> <chr> <fct> <fct> <fct> <chr>
1 099_AX 099A 099_AX Case UC severe active female
2 199_AX 199A 199_AX Control Other control control female
3 062_BZ 062B 062_BZ Case CD mild active male
4 194_AZ 194A 194_AZ Case UC mild active male
5 166_AX 166A 166_AX Case UC severe active female
6 219_AX 219A 219_AX Case UC inactive inactive female
7 132_AX 132A 132_AX Case CD mild active female
8 026_AX 026A 026_AX Case UC mild active male
9 102_AZ 102A 102_AZ Control Other control control male
10 140_AX 140A 140_AX Control Other control control female
# ℹ 80 more rows
# ℹ 10 more variables: ethnicity <chr>, family_history <chr>, age_years <dbl>,
# immunosuppression_level <fct>, medication <chr>, ibd_fhx <lgl>, abx <lgl>,
# steroids <lgl>, imsp_other <lgl>, imsp_any <lgl>
# get a small part of the OTU table
otu_get(ps, taxa = 1:5, samples = c("132_AX", "166_AX", "102_AZ"))OTU Table: [5 taxa and 3 samples]
taxa are columns
OTU_00001 OTU_00002 OTU_00003 OTU_00004 OTU_00005
132_AX 121 0 7 0 0
166_AX 138 3 12 0 13
102_AZ 0 0 0 0 0
Taxonomy Table: [3 taxa by 6 taxonomic ranks]:
Kingdom Phylum Class Order
OTU_00001 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
OTU_00002 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales"
OTU_00003 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
Family Genus
OTU_00001 "Ruminococcaceae" "Faecalibacterium"
OTU_00002 "Enterobacteriaceae" "Escherichia/Shigella"
OTU_00003 "Lachnospiraceae" "Blautia"
The taxonomy table contains some missing values e.g. when the genus classification is not known.
microViz can fix those gaps by using the next highest level of classification that is available, e.g. Family.
ps <- tax_fix(ps)Sequencing data are compositional. The total number of reads per sample is mostly arbitrary, and so the counts should be interpreted as relative abundances instead of absolute abundances.
A simple way to visualise compositional data is as percentages, proportions of a whole.
Stacked bar charts are great for this: each bar represents one sample, and we can show the abundance data as proportions, after aggregating the counts by taxonomy.
ps %>% comp_barplot("Phylum", n_taxa = 4, label = NULL)Registered S3 method overwritten by 'seriation':
method from
reorder.hclust vegan
This plot is aggregated into phyla, which is easy to read, and provides basic information.
We see that most samples contain a mixture of Firmicutes and Bacteroidetes, but some appear to be dominated by Proteobacteria instead.
We can visually compare the patient groups in this dataset by “faceting” the samples on the plot.
ps %>%
comp_barplot("Phylum", n_taxa = 4, label = NULL) +
facet_grid(cols = vars(diagnosis), scales = "free") +
force_panelsizes(cols = c(table(ps@sam_data$diagnosis)))It seems that most of the patients who have gut microbiota dominated by Proteobacteria are Ulcerative Colitis patients (UC). Let’s dig deeper into the taxonomic classifications.
ps %>% comp_barplot("Genus", n_taxa = 11, merge_other = F, label = NULL) +
facet_grid(cols = vars(diagnosis), scales = "free") +
force_panelsizes(cols = c(table(ps@sam_data$diagnosis)))This plot is aggregated into genera, which provides more detail, but is harder to read, and we cannot give every genus a distinct colour.
We see that many samples contain a relatively large proportion of Bacteroides or Faecalibacterium but there is quite some further variation! The Proteobacteria-dominated samples seem to contain OTUs classified as Escherichia/Shigella (16S amplicons can’t tell us which) or Klebsiella.
Some taxa could not be classified at genus level. Here we see a group named Lachnospiraceae Family, instead of a genus name.
As a last task for this practical, we will calculate and visualise alpha diversity.
diverse == healthy is not TRUE for all ecosystems (e.g. early infant gut microbiome)ps_calc_richness() computes the index for each sample and adds it to your sample_dataps_alpha <- ps %>% ps_calc_richness(rank = "Genus", index = "observed", varname = "N_genera")Next we will calculate the Shannon diversity index at the level of Genus.
We will also compute an exponentiated version, the effective number of genera, or effective Shannon index.
The exponent of the Shannon index \(e^H\) represents the number of taxa (genera) that would be present in an evenly abundant ecosystem with the same Shannon index.
ps_alpha <- ps_alpha %>%
ps_calc_diversity(index = "shannon", rank = "Genus", varname = "Shannon_Genus") %>%
ps_mutate(Effective_Shannon_Genus = exp(Shannon_Genus))
ps_alphaphyloseq-class experiment-level object
otu_table() OTU Table: [ 36349 taxa and 90 samples ]
sample_data() Sample Data: [ 90 samples by 20 sample variables ]
tax_table() Taxonomy Table: [ 36349 taxa by 6 taxonomic ranks ]
The new genus-level richness and diversity variables are stored in the sample data slot of ps_alpha
samdat_tbl(ps_alpha) %>% select(sample, N_genera, Shannon_Genus, Effective_Shannon_Genus)# A tibble: 90 × 4
sample N_genera Shannon_Genus Effective_Shannon_Genus
<chr> <dbl> <dbl> <dbl>
1 099_AX 22 1.89 6.63
2 199_AX 37 2.71 15.1
3 062_BZ 27 1.66 5.25
4 194_AZ 15 1.52 4.59
5 166_AX 23 1.92 6.81
6 219_AX 18 1.22 3.39
7 132_AX 42 2.35 10.5
8 026_AX 42 2.14 8.47
9 102_AZ 58 2.56 12.9
10 140_AX 18 1.32 3.74
# ℹ 80 more rows
First we can plot basic histograms of the richness and diversity values we observe. Try changing the variable name to make different plots, and notice the different range of values for each.
hist(samdat_tbl(ps_alpha)[["N_genera"]])# make another plot here? copy-paste and edit the variable nameWe can make box plots to compare diversity across groups.
ps_alpha %>%
samdat_tbl() %>%
ggplot(aes(x = Effective_Shannon_Genus, y = diagnosis, color = diagnosis)) +
geom_boxplot(outliers = FALSE) +
geom_jitter(height = 0.2) +
theme_classic()It looks like the Ulcerative Colitis patients might have a lower diversity on average.
But we should check this with statistical testing.
You can use standard statistical testing on the diversity values e.g. linear regression or ANOVA
eShannon_lm <- lm(data = samdat_tbl(ps_alpha), formula = Effective_Shannon_Genus ~ diagnosis)
anova(eShannon_lm)Analysis of Variance Table
Response: Effective_Shannon_Genus
Df Sum Sq Mean Sq F value Pr(>F)
diagnosis 2 158.96 79.482 7.5185 0.0009731 ***
Residuals 87 919.73 10.572
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = eShannon_lm)
$diagnosis
diff lwr upr p adj
CD-Other -1.541564 -3.803828 0.7207006 0.2406931
UC-Other -3.170198 -5.145627 -1.1947686 0.0007080
UC-CD -1.628634 -3.631435 0.3741666 0.1339407
For simple group comparisons like this, the ggstatsplot package works like magic, making plots with detailed statistical reporting added automatically.
ggbetweenstats(
data = samdat_tbl(ps_alpha),
x = diagnosis, y = Effective_Shannon_Genus,
bf.message = FALSE
)This last section is just an exercise to gain a bit more intuition for the richness and diversity measures.
Each sample is sorted and labelled by the observed richness of genera
Can you spot samples with equal richness but clear differences in evenness?
ps_alpha %>%
ps_arrange(N_genera) %>%
comp_barplot(
tax_level = "Genus", n_taxa = 19, merge_other = FALSE,
sample_order = "asis", label = "N_genera"
) +
coord_flip()Now each sample is labelled with the effective Shannon diversity of genera (\(e^H\))
Do you see the general relationship of \(e^H\) with sample composition?
ps_alpha %>%
ps_arrange(Effective_Shannon_Genus) %>%
ps_mutate(short_shannon = formatC(Effective_Shannon_Genus, digits = 1, format = "f")) %>%
comp_barplot(
tax_level = "Genus", n_taxa = 19, merge_other = FALSE,
sample_order = "asis", label = "short_shannon"
) +
coord_flip()We have assembled a phyloseq object and calculated richness and diversity measures.
Let’s store the result of this processing, by writing the phyloseq object to a file.
We can use the saveRDS() function to do this.